27 May, 2021
Data (geo)science project
Exploring your data
Modelling
Wrangling (Load, Tidy and Transform)
Each section:
5–10 minutes of introduction
5–10 minutes of live coding and questions
# Install PAGES from GitHub:
# install.packages("devtools")
devtools::install_github("MartinSchobben/PAGES", build_vignettes = TRUE)
Load PAGES with library.
library(PAGES)
This class is completely based on Hadley Wickham’s and Garrett Grolemund’s R4DS.
I have augmented the examples with cases from geology.
Wickham and Grolemund (2016)
The tidyverse universe: opinionated collection of R packages designed for data science.
In stratigraphy we look at variations in the value of a variable through time (or time series).
The time unit is measured in height or depth, and can sometimes be calibrated for absolute time (age model).
R package PAGES contains the lazy load data: bonenburg (geochemistry) and kuhjoch (palynology). For more information on the dataset use ? (?bonenburg).
Wickham and Grolemund (2016)
RStudio projects
A clear directory and file structure with meta-data to describe data. Raw data should be read-only and backed-up.
R script with a clear documentation of all steps involved.
Publish all aspects of this workflow along with your paper.
Pipe: %>%
kuhjoch_grps <- group_by(kuhjoch_long, type) kuhjoch_sum <- summarise(kuhjoch_grps, mean(count))
kuhjoch_sum <- summarise(group_by(kuhjoch_long, type), mean(count))
kuhjoch_sum <- group_by(kuhjoch_long, type) %>% summarise(mean(count))
Wickham and Grolemund (2016)
“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>,
orientation = <ORIENTATION>
) +
<FACET_FUNCTION>
Grammar of graphics (Wilkinson et al. 2005)
Scatterplot: maps each observation to a horizontal and vertical position and the geom represents this as a point.
ggplot(data = bonenburg) + geom_point(mapping = aes(x = del13Ctoc, y = height))
The colour, shape and linetype can also be used to map additional variables. Here I use the stratigraphy (categorical) as an additional variable.
ggplot(data = bonenburg) + geom_point(mapping = aes(x = del13Ctoc, y = height, colour = strat))
Internal (statistical) transformation of <DATA>.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>,
orientation = <ORIENTATION>
) +
<FACET_FUNCTION>
Typical questions
Wickham and Grolemund (2016)
ggplot(data = bonenburg) + geom_boxplot(mapping = aes(y = strat, x = del13Ctoc), stat = "boxplot")
ggplot(data = bonenburg) + geom_boxplot(mapping = aes(y = reorder(strat, height), x = del13Ctoc))
The facets split data according to a categorical variable.
ggplot(data = bonenburg_long) + geom_point(mapping = aes(x = value, y = height)) + facet_grid(cols = vars(measurement), scales = "free_x")
Discern patterns (or signals) from noise.
Exploration, not confirmation or formal inference!
Wickham and Grolemund (2016)
ggplot(data = bonenburg_cross, mapping = aes(x = value, y = del13Ctoc)) + geom_point(aes(colour = strat)) + facet_wrap(facets = vars(measurement), scales = "free")
ggplot(data = bonenburg_cross, mapping = aes(x = value, y = del13Ctoc)) + geom_point(aes(colour = strat)) + geom_smooth() + facet_wrap(facets = vars(measurement), scales = "free")
ggplot( bonenburg, aes(x = TOCcfb, y = del13Ctoc) ) + geom_point(aes(colour = strat)) + geom_smooth(method = "lm")
lm(del13Ctoc ~ TOCcfb, bonenburg)
ggplot( bonenburg, aes(x = log(TOCcfb), y = del13Ctoc) ) + geom_point(aes(colour = strat)) + geom_smooth(method = "lm")
lm(del13Ctoc ~ log(TOCcfb), bonenburg)
This was a very simple, exploratory analysis of the data.
Fitting models:
Further reading:
Wickham and Grolemund (2016)
Load your data into R with the readr package:
read_csv(): comma separated (CSV) filesread_tsv(): tab separated filesread_delim(): general delimited filesDescription on website: “In many cases, these functions will just work!”
Reversely, you can also write back to several file formats with write_*
PAGES_example()
## [1] "bonenburg_cross.csv" "bonenburg_long.csv" "bonenburg_raw.csv" ## [4] "bonenburg_tidy.csv" "bonenburg.csv" "kuhjoch_long.csv" ## [7] "kuhjoch_raw.csv" "kuhjoch.csv"
read_csv(PAGES_example("bonenburg_raw.csv"))
## # A tibble: 108 x 13 ## SampleID Height CaCO3 TN del13Ctoc TOCcfb `Al2O3 (%)` `Na2O (%)` `K2O (%)` ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0 3.01 13.3 0.06 -27.5 1.16 15.6 0.62 3.25 ## 2 60 3.56 2.67 NA -25.5 0.27 13.1 1.27 3.33 ## 3 100 3.95 3.84 0.07 -27.3 0.96 17.4 0.55 3.54 ## 4 150 4.43 5.86 0.07 -27 1.25 17.6 0.44 3.79 ## 5 200 4.94 12.8 0.07 -27.8 1.52 16.4 0.48 3.73 ## 6 250 5.25 3.34 0.09 -27.6 2.45 14.6 0.61 3.42 ## 7 275 5.68 9.91 0.06 -27 1.19 17.3 0.44 4.18 ## 8 300 5.92 NA NA NA NA 15.5 0.46 3.74 ## 9 300 6.16 22.2 0.06 -27.1 1.21 NA NA NA ## 10 350 6.41 20.5 0.06 -27.5 1.14 15.1 0.51 3.92 ## # … with 98 more rows, and 4 more variables: Strat <chr>, Strat2 <chr>, ## # Section <chr>, Reference <chr>
There are three interrelated rules which make a dataset tidy:
Create, rename, reorder variable and summarise with tidyverse dplyr.
mutate() e.g., K/Al from K and Alselect() e.g., pick height and K/Alfilter() e.g., all observations above 3 meterssummarise() e.g., combined with group_by() calculate mean value of K/Al for lithological unitsarrange() e.g., arrange ascending with heightgrouping: group_by() or rowwise()
XRF oxides and normalization with elemental ratios.
mutate(
bonenburg_tidy,
# oxide correction
Al_pc = Al2O3_pc * with(marelac::atomicweight, 2 * Al / (2 * Al + 3 * O)),
Na_pc = Na2O_pc * with(marelac::atomicweight, Na / (Na + 2 * O)),
K_pc = K2O_pc * with(marelac::atomicweight, K / (K + 2 * O)),
.keep = "unused"
) %>%
# normalization with Al and rename
mutate(
across(c(Na_pc, K_pc), ~.x / Al_pc, .names = "{gsub(\"pc\", \"\", .col)}Al"),
.keep = "unused"
)
Steps of a data (geo)science project:
readr.tidyr.dplyr.ggplot2.ggplot2 and other tools.Data
lazy load data: bonenburg and kuhjoch as well as the long formats: bonenburg_long and kuhjoch_long.
Raw dataPAGES_example()
Examples
- project: vignette("project", package = "PAGES)
- explore: vignette("explore", package = "PAGES)
- model: vignette("model", package = "PAGES)
- wrangle: vignette("wrangle", package = "PAGES)
Slidesrender_slides()
Dalgaard, Peter. 2008. Introduction to statistics with R. Edited by J Chambers, D Hand, and W. Hardle. Springer. https://doi.org/10.1201/9780429341830-12.
Fox, John, and Sanford Weisberg. 2018. An R companion to applied regression. Sage publications.
Schobben, Martin, Julia Gravendyck, Franziska Mangels, Ulrich Struck, Robert Bussert, Wolfram M. Kürschner, Dieter Korn, P. Martin Sander, and Martin Aberhan. 2019. “ A comparative study of total organic carbon-\(\delta\)13C signatures in the Triassic–Jurassic transitional beds of the Central European Basin and western Tethys shelf seas.” Newsletters on Stratigraphy 52 (4): 461–86. https://doi.org/10.1127/nos/2019/0499.
Wickham, Hadley, and Garrett Grolemund. 2016. R for data science: import, tidy, transform, visualize, and model data. O’Reilly Media, Inc. https://r4ds.had.co.nz/index.html.
Wilkinson, Leland, Graham Wills, D Rope, Andrew Norton, and Roger Dubbs. 2005. The Grammar of Graphics (Statistics and Computing).
Zuur, Alain F., Elena N. Ieno, Neil J. Walker, Anatoly A. Saveliev, and Graham M. Smith. 2008. Mixed Effects Models and Extensions in Ecology with R. https://doi.org/10.4324/9780429201271-2.